After completing this lesson, you’ll be able to:
In many workplaces, data sources and their schemas can change without notice. Attributes and data types can be added, removed, and changed depending on who manages and uploads the data.
Imagine a scenario where a new hire doesn’t know the set schemas for current datasets and tries to update a dataset with some newly created data. Now, you either have a dataset with unwanted additional columns that duplicate the old columns, or you’re missing data in your existing dataset because the schemas didn’t match. This can be a huge headache for whoever manages the data in the long term, and nobody likes going back and correcting datasets.
Schema drift can even occur in organizations with stronger data governance practices. Consider what happens if you build an integration with one version of an external vendor API, but then the API is upgraded, and the results of your call change. You are now dealing with a new schema due to factors largely outside your control.
With the SchemaScanner transformer, we can be proactive by detecting and reacting to schema drift in our datasets. These two essential steps make it easy to adapt to changing schemas with FME.
For example, we have a dataset in our database with a table containing the following attributes:
Every few days, an employee uploads a new set of recordings to add on top of this data. The employee may not be familiar with the original dataset and has made some assumptions in naming the attributes. Their dataset contains the following attributes:
In this case, the schema has drifted, with the PathType attribute missing and the PathLength attribute having the incorrect name: Distance.
Before the user can upload the data to the existing dataset, we want to be able to detect changes in the schema so we can notify the user to change their input data schema.
This is where the SchemaScanner comes into use. It allows us to extract the schema of the incoming dataset and compare it against the known, existing dataset before allowing the data to be uploaded. If the schemas match, then all is good to go, and we can allow the data to flow freely into the base dataset. However, if the data’s schema doesn’t match the base dataset, we must react to what is happening here. In our case, we will create a Schema Change Report that we can then return to the user using email or an FME Server App.
The first step in our example is to detect the schema drift. We will use the SchemaScanner and the ChangeDetector transformer to detect changes between the original schema and the schema of the uploaded data.
First, create a new blank workspace in FME Workbench (2022.1 or later).
Add a reader to the workspace and set the Format to Esri Shapefile. Set Dataset to https://s3.amazonaws.com/FMEData/FMEData/Resources/DynamicWorkflows/Data/BikePaths_L.zip (or C:\FMEData\Resources\DynamicWorkflows\Data\BikePaths_L.shp).
Press OK to add the reader to the canvas.
Now that we have the original dataset in our workspace, we want to add the updated data to the workspace to compare against it. Again, add a reader to the canvas and select Esri Shapefile as the Format. Set Dataset to https://s3.amazonaws.com/FMEData/FMEData/Resources/DynamicWorkflows/Data/Updated_BikePaths_L.zip (or C:\FMEData\Resources\DynamicWorkflows\Data\Updated_BikePaths_L.shp). Because the dataset may change at any time, we want to ensure we can read the data dynamically. To do so, change Workflow Options to Single Merged Feature Type. Press OK to add the reader to the canvas.
Improving workspace performance tip: Since we only need to receive one feature from the reader to extract the dataset’s schema, we can limit the reader to reading a single feature. To do this, go to the Navigator Window, expand the reader’s parameters, click on Parameters > Features to Read, and set Max Features to Read to 1. You can do this for both the shapefile readers.
Next, we need to extract the schema from the source base dataset and prepare it for comparison against the newly uploaded dataset. To do this, add a SchemaScanner to our workspace and connect the BikePaths_L shapefile feature type to it.
The SchemaScanner won’t work with default settings because it includes unexposed attributes in the schema feature. We want to keep only the original exposed attributes and remove any additional attributes within the schema related to FME or the format itself.
Open the SchemaScanner parameters. For Ignore Attributes Containing, set the value to:
^multi|^fme|^shape
This will remove any attributes that contain "multi", "fme", or "shape" from being output as part of the scanned schema.
Press OK to accept the new parameters.
If we were to run the workspace, the output schemas from the SchemaScanner would not be visible in the Visual Preview table as they remain part of the attributes{} list; they would only be visible in the Feature Information window. Add a ListExploder transformer to your workspace and connect it to the SchemaScanner <Schema> port.
In the parameters, set the List Attribute to attribute{}.
We also want to extract the schema from the Updated dataset for comparison against the original dataset. Hold the CTRL key and select both the SchemaScanner and the ListExploder. Then, duplicate (CTRL+D or right-click > Duplicate) the transformers. Connect the Updated_BikePaths_L dataset to SchemaScanner_2. Ensure that the SchemaScanner_2 is connected to the ListExploder_2 transformer. Your workspace should now look like this:
Save and then run the workspace with feature caching enabled to see our output schema. We expect to see a list of the attribute names and their data types. Below you can see an example of the original dataset’s ListExploder output.
Now that we’ve extracted the schemas from our datasets into a set of attributes, we can use the ChangeDetector transformer to compare the schemas.
Add a ChangeDetector transformer to your workspace. Connect the ListExploder for the original dataset to the Original input port. Connect the ListExploder_2 for the updated dataset to the Revised input port. Open the ChangeDetector parameters. Under Check Attributes, click the ellipsis next to Selected Attributes and select name.
Press OK twice to accept the transformer parameters.
Run the workspace. You should see that the schema has changed. One new attribute has been Inserted, two attributes have been Deleted, and two remain Unchanged.
This workspace can now detect schema drift. In the next exercise, we'll react to this schema drift by producing a report for the user, notifying them that schema drift has been detected.